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Abstract. — One of the central tasks in evolutionary biology is to reconstruct the 
evolutionary relationships among species from sequence data, particularly from multilocus 
data. In the last ten years, many methods have been proposed to use the variance in the 
gene histories to estimate species trees by explicitly modeling deep coalescence. However, 
gene flow, another process that may produce gene history variance, has been less studied. 
In this paper, we propose a simple yet innovative method for species trees estimation in the 
presence of gene flow. Our method, called STEST (Species Tree Estimation from 
Speciation Times), constructs species tree estimates from pairwise speciation time or 
species divergence time estimates. By using methods that estimate speciation times in the 



presence of gene flow, (for example, Ml (Yang 2010) or SIM3s (Zhu and Yang 2012)), 
STEST is able to estimate species trees from data subject to gene flow. We develop two 
methods, called STEST (Ml) and STEST (SIM3s), for this purpose. Additionally, we 



consider the method STEST (MO), which instead uses the MO method (Yang 2002), a 
coalescent-based method that does not assume gene flow, to estimate speciation times. It is 
therefore devised to estimate species trees in the absence of gene flow. Our simulation 
studies show that STEST (MO) outperforms STEST(Ml), STEST (SIM3s) and STEM in 
terms of estimation accuracy and outperfroms *BEAST in terms of running time when the 
degree of gene flow is small. STEST (Ml) outperforms STEST (MO), STEST (SIM3s), 
STEM and *BEAST in term of estimation accuracy when the degree of gene flow is large. 
An empirical data set analyzed by these methods gives species tree estimates that are 
consistent with the previous results. 

(Keywords: species tree estimation, speciation time, gene flow, migration, coalescent, 
multilocus data) 
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Introduction 

Species tree estimation is one of the most fundamental problems in evolutionary 
biology. Gene trees estimated from sequences sampled from the corresponding species were 
once treated as the species tree estimate before adavances in sequencing techniques made 
multilocus data available for routine phylogenetic analysis. Now, it is well-appreciated that 
an embedded gene tree may not match its underlying species tree, i.e., a gene may have a 



different evolutionary history from its underlying species (Fitch 1970; Tajima 1983 Pamilo 



and Nei|l988| |Felsenstein|[2004l ) . The causes for such incongruence include deep coalescence, 
horizontal gene transfer (HGT) or lateral gene transfer (LGT), and gene duplication/loss 
(see Fig. [I]). Deep coalescence, also called incomplete lineage sorting (ILS), refers to the 
case when the coalescent time is deeper into the past than the previous speciation time 
(Fig. [Ik), which might result from large population sizes or short speciation times 



(Maddison 1997). HGT or LGT refers to gene flow between organisms that is not through 



reproduction (Fig. [Tjo). The probability of HGT between distinct species is different. It is 
widely accepted that in prokaryotes, HGT happens frequently, thus playing an important 
role in evolution (Boto 2010). In addition, more evidence for HGT is being found in other 



cases, such as HGT between Bacteria and Eukarya (Watkins and Gary 2006 Guljamow 



et al. 2007), and within Eukayrya (Nedelcu et al. 2008). Gene duplication is also an 



important mechanism in molecular evolution. It usually refers to the duplication of regions 
of DNA that contain at least one gene (Fig. [lb- A). Gene loss is the loss of DNA sequences 
of genes (Fig.[l]>B). There are many factors that could lead to gene loss, such as unequal 
crossing over and losses from translocation. Both gene duplication and gene loss are very 



common and their mechanisms have been extensively studied (Dittmar and Liberies 2011) 



Coalescent Theory and Deep Coalescence 
The best-studied of these processes is deep coalescence, mainly because Kingman's 
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coalescent theory, a continuous-time retrospective model in which the genes sampled in 
individuals can be traced back to a common ancestor known as the most recent common 
ancestor (MRCA), provides a way to model the in-pop ulation coalescent processes and to 
link them on a phylogenetic tree. To illustrate this idea, we consider a three-taxon species 
tree S =((1,2), 3) with only one lineage sampled from each population. There are three 
possible gene tree topologies ((1,2), 3), ((2,3),1) and ((1,3), 2), with four possible gene 
histories H a , Hf,, H c , and Hd (see Figs. [2^i~d). According to Kingman ( 1982a[b ), in a 



population with 9 = ANfi, where N is the effective population size and /i is the mutation 
rate, the random variable T, defined to be the time for n lineages to coalesce into n — 1 
lineages, follows an exponential distribution with the parameter (™) f • I n our case, let T be 
the time to the coalescent event of the two lineages sampled from population 1 and 
population 2 (see Fig. |2^i). Assume 2/9 12 = 1 so that T ~ Exp(l) in the population 12. Let 
t be the time interval between the two population divergence events. Then the probability 
of the gene history H a is equal to the probability that T is smaller than or equal to t. So 
the probability of gene history H a can be expressed as a function of t, 



P(H a \S) = P(T <t)= [ e~ x dx = 1 

Jo 



Since mating is random in the population 123, Gb, G c and Gd all have the same probability 
(Figs.@b,c,d), 

P(H h \S) = P(H C \S) = P(H d \S) = i(l - P(H a \S)) = ie"*. (2) 
Therefore, the probability of gene tree G given species tree S is 



P(G = ((1,2), 3)\S) = P(H a \S) + P(H b \S) = 1 - |e-*, 



(3) 
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P(G = ((2,3), 1)\S) = P(H C \S) = -e' 



(4) 



P(G=((l,3),2)\S) = P(H d \S)= 1 -e- t . 



(5) 



Equations [3j |4j and [5] can be used to calculate the distribution of gene trees G given a 
species tree S. On the other hand, species trees can be estimated by examing the gene tree 
distribution. Based on similar ideas, a number of methods have been proposed to estimate 
species trees with the assumption that the conflict between gene trees and species trees are 
due solely to deep coalescence. 



*BEAST (Drummond and Rambaut 2007 Drummond et al. 2012) and BEST 



(Bayesian Estimation of Species Trees Under the Coalescent Model) |Liu 2012) are two 
widely-used Bayesian inference programs. *BEAST uses Makov Chain Monte Carlo 
(MCMC) to jointly estimate the posterior distribution of the target species tree as well as 
all the gene trees and other population parameters such as mutation rates and population 
sizes. BEST deploys a hierarchical MCMC approach for the same purpose. The program 



STEM (Species Tree Estimation Using Maximum Likelihood) (Kubatko et al. 2009) takes a 



set of gene trees as the input and returns the maximum tree (MT) as an estimate of their 
underlying species tree. Liu (2006; see also Liu and Pearl, 2010) has shown that MT is 
statistically consistent if the gene trees are known. This method was also developed 
independently by Mossel and Roch (2010) under the name GLASS (Global LAteSt Split). 
STEM can be viewed both as a maximum likelihood method and as a distance method, 
since MT is a maximum likelihood estimate under suitable conditions but it can be built 
from a distance matrix where each entry is the smallest coalescent time of genes from every 
pair of species. 
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The IM Model and Gene Flow 

There are more phylogenetic inference programs that model deep coalescence than 
those listed here (See Felsenstein's website 



http : // evolution . genetics . Washington . edu/phylip/sof tware . html/ for a more 



complete list). However, how to appropriately model gene flow for species tree estimation 



has been less well-studied and still remains a big challenge. Maddison (1997) described the 



gene tree parsimony method that picks the species tree with the minimal number of 



migration events, and then a decade later, Eckert and Carstens (2008) and Leache et al. 



(2014) examined the accuracy of species tree estimates from simulated data subject to gene 
flow for several of the existing species tree estimation methods, none of which models the 
process of gene flow. They concluded that the existence of migration may complicate the 



phylogenetic inference problem in many situations. Kutschera et al. (2014) also confirmed 
this conclusion in an empirical data study. 

The IM (Isolation-with-Migration) model, which can be used to calculate the 
probability density of coalescent times in the presence of gene flow, may be the key tool to 
solve the problem. To introduce the IM model, we consider a two-population IM model 
that involves six parameters, 9i, 9 2 , 9a, t , m i2 and m 2 i (Fig. [3]). We define 
9{ = 4iVj/i, i G {1, 2, A} where iVj is the effective population size for the corresponding 
population i and /i is the mutation rate per generation, r is the speciation time (the length 
of the time interval from the time of speciation to the present). We further define 
rriij = Mij/fi, where My is the migration rate from population i to population j per 
generation. Assume that one lineage is sampled from each population. Let the state SVjj) 
indicate i genes in population 1 and j genes in population 2. We can enumerate all the 
possible states before time r (if not specifically mentioned time is always viewed from 
present to past throughout the text): S(ia) (also the initial state), 5(2,0), S (0,2)1 S(i,o)i >%>,i)- 
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To formulate an instantaneous rate matrix, the transition rate between every pair of states 
needs to be calculated. For example, Figure [3]>A illustrates the transition from state £(1,1) 
to state 5(2,0) through a migration event from population 2 to population 1, which has rate 
777,21. Figure |3Jd-B is the transition from state 5(2,0) to state £(1,0) through a coalescent 
event in population 1, which has a rate of 2/9\. Figure [3]>C is the transition from state 
5(1,0) to state 5(o,i) through a migration event, which has rate 77112. The instantaneous rate 
matrix Q is given below: 



Q 





5(1,1) 


5(2,0) 


5(0,2) 


5(1,0) 


5(o,i) 


5(1,1) 








0 


0 


TO21 


mi2 


5(2,0) 


2mi2 




0 


2 


0 


5(0,2) 


2m 2 i 


0 




0 


2 

01 


5(1,0) 


0 


0 


0 




mi2 


5(o,i) 


( 0 


0 


0 


772,21 





(6) 



The diagonal entries are filled in so that the sum of each row is zero. After time r, there 
are 2 possible cases: 

I. There is only one lineage in the ancestral population, which means the coalescent 
event has happened before time r. Therefore, the state at time r could be either 
5(o,i) or 5(1,0). 

II. There are two lineages at time r in the ancestral population. The state at time r 



could be 5, 



5(i,i) or 5(2,o)- 



(0,2), 0(1,1) 



Following Hobolth et al. (2011), the continuous-time Markov chain representation can be 
used to get the matrix of probabilities of transitions between the states as a function of 
time. This transition probability matrix is obtained as the solution P(t) to the system of 
differential equations P'(t) = QP(i) with initial condition P(0) = I. The solution is 
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P(t) = e^*, which we use to derive the probability density function for the two cases listed 
above. Let t be the time to the coalescent event. 

Case I: Note that this case corresponds to t < r, and we must consider two possibilities: 
(1) If the coalescent event occurs in population 1, the density for time t< r is 

2 

f(S(i,x),S (m )(t) = ( eQt )(S CM) ,S (li0) )(^-), ( 7 ) 

where (e Qt )(s (i j)t s^ t) ) is the entry (a, b) in the matrix e Qt if S^j) is in the a th row and S^ S) t) 
is in the h th column. 

(2) If the coalescent event occurs in population 2, the density for time t< r is 

2 

/(s ( i,i),S(o,i))(*) = (e t )(s (lil) ,s (0 , 1) )(^)- (8) 
Case II: If the coalescent event occurs after time r, the density for t > r is 



/(s (lil)1 s (y)i+ . =2 )(t) = 5^ (e Qr )(s (1 , 1)) s (iJ) )^2->i(*-r), 

i+j=2 



(9) 



where g n ^i(y) is the probability density function for n genes to coalesce to 1 gene in time 



y. This probability is well-known (Tavare 1984 Takahata and Nei 1985; Wakeley 2009 



Efromovich and Kubatko 



2008). The special case ^2—^1 (2/) = j~ e $A follows from basic 



coalescent theory. 

Equations [7|- [9] can be used to derive formulas to calculate the distribution of gene 
trees given a species tree with the presence of gene flow in the same way that we have 
previously. Questions arise whether a similar approach is appliable to estimate species trees 
in the presence of gene flow. At least so far, such a phylogenetic inference program hasn't 
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been developed. Nontheless, the IM model has already been widely used in demographic 



parameter estimation. Hey and Nielsen (2004) developed a Bayesian program called IM 



under an IM model, which is the first software to jointly estimate speciation times, 
population sizes and migration rates. The upgraded versions are IMa (2007) and IMa2 
(2010). |Zhu and Yang (2012) also developed an IM-model-based likelihood method to 



jointly estimate speciation times, population sizes and migration rates. All of these 
methods assume that the correct species tree topology is known. In order to study 
populations with gene flow, many researchers first obtain a species tree estimate by a 
species tree estimation program that doesn't allow the possibility of gene flow. Then they 
treat this species tree estimate as the correct phylogeny and use a demographic parameter 
inference program to evaluate the magnitude of migration. However, they are risking the 
chance that errors in the species tree estimation may also collapse the demographic 
parameter estimation. 

In this paper, we propose a distance method called STEST (Species Tree 
Estimation from Speciation Times) to estimate species trees in the presence of gene flow. 
The idea is to use pairwise speciation time or species divergence time estimates as 
distances to construct a species tree. Species tree estimation error is not involved in a 
two-species case, where there is only one possible species tree topology. Therefore pairwise 
speciation times are first estimated by a speciation time estimation method that assumes 
the possbility of gene flow. Then a sequential clustering algorithm is applied to construct 
the species tree. Despite the fact that the motivation to develop this method is to 
accomodate gene flow, we also evaluate the performance of our method using a speciation 
time estimation method that assumes no gene flow. 
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Methods 

Our method STEST consists of two parts: creation of a distance matrix and use of 
a clustering algorithm based on this matrix to construct the tree. We will describe each of 
these steps in detail in the following sections. 



Speciation Time Estimation Methods 

To create a distance matrix, a speciation time estimation method needs to be picked 
first. Here, we prefer maximum likelihood methods over Bayesian methods for their short 



computation time. Three different likelihood methods are considered: Yang (2002)'s 



method MO, Yang (2010 )'s method Ml, and Zhu and Yang (2012)'s method SIM3s. Ml 



and SIM3s allow the possbility of gene flow while MO does not. All three estimate the 
parameters of interest by searching a point in the parameter space that maximizes the 
likelihood function. The following is a brief description of the formulation of their 
likelihood functions. 



SIMSs. — We start with the method SIM3s because the other two methods can be 
illustrated under the SIM3s model's setting (see Fig. |4^i): there are three populations 1, 2 
and 3 (outgroup); r 0 and n are speciation times; 9i = 4iVfc/i {k = 1, 2, 3, 12, 123) are the 
population size parameters with iV's being the effective population sizes and \i the 
mutation rate per site; gene flow is assumed to exist between population 1 and population 
2 with migration rates m\ 2 and m 21 ; assume that Q\ = 9 2 = 9 and m 12 = m 21 = m, and 
that one lineage is sampled from each species. By convention, 6 = (9, # 12 , 6 I 12 3, m, r 0 , ri) is 
the parameter vector. Under this setting, there are 6 possible gene histories, Hi a , H u , H lc) 
H ld , H 2 and H 3 (Figs. |I]D~g). That gene flow only exists between population 1 and 
population 2 before time T\ fits a two-population IM model. By setting m 12 = m 21 = m, 
9i = 9 2 = 9, r = T\ and 9a = #12, formulas [6] - [9] in the introduction can be used. Let ti, to 
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be the times to the first and to the second coalescent events, respectively. We can derive 
the probability density /(ti, t 0 , H\Q) of coalescent times t 1: t 0 and gene history H given 9 
(for convenience, we write f(ti,to,H) instead in the cases where no ambiguity arises). 



For gene history H la (Fig. |4p), t\ < Ti, the first coalescent event occurs in 
population 1 and the second coalescent event occurs in population 123, so 



f(tx,t 0 ,H la ) = / ( s CM) ,5 (li0) )(ti) x -^e- 2{t °- To),e ™. (10) 



For gene history (Fig. |4p), t\ < tl, the first coalescent event occurs in 
population 2 and the second coalescent event occurs in population 123, so 

fih^Hu) = / ( s (M) ,s (0jl)) (*i) x A- e -2(*o-r 0 )/* 123 . 



12:! 



For gene history H\ c (Fig. |4p), n < t\ < tq, the first coalescent event occurs in 
population 12 and the second coalescent event occurs in population 123, so 



m,t 0 ,H lc ) = /( %il)>% , )i+ ^)(ri) x ^e- 2( ^ l)/ei2 x ^ e -^)M» (12) 



For gene history Hid, H 2 and H 3 (Figs. |4^,f,g), ti > r 0 and both coalescent events 
occurs in population 123, so 

f(t 1 ,t 0 ,H ld ) = f(t 1 ,t 0 ,H 2 ) = f(ti,t 0 ,H 3 ) = -x(l-f(t 1 ,t 0 ,H la )-f(ti,t 0 ,H lb )-f(t 1 ,t 0 ,Hi c )). 

(13) 

Suppose the sequences at each locus are aligned and no gaps exist. At each site, 
there are five possible patterns: QljQljQljj 't]C'*c\jtj yoooOy ^cijoc and xyz, where x, y } z are symbols for 
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different nucleotides. At any locus i, the sequence alignments are first summarized into site 
pattern counts Di = {^r/}j=i,2,3,4,5, where is the number of the j th site pattern observed. 
Let D = {-Dj}™ be the data for n unlinked loci and assume the JC69 mutation model 



(Jukes and Cantor 1969). Equations 10-13 and Yang (1994)'s formula for the conditional 



probability P(Di\ti,t 0 , H) of Di given ti, ti and H are then combined together to derive 
the likelihood function of G given D, 



L(e\D) = l[p(D l \e), 



i=l 



where 



P(D l \&)= I I P(D,\H,.t,.t l )f(t u .t [ .H,)dt i] dt l . 

fee{la,lb,lc,ld,2,3} ' 



(14) 



(15) 



MO. — The method MO adopts a reduced model of SIM3s, which assumes no gene flow. 
Under this setting, there are only 4 possible gene histories: Hi c , H ld , H 2 and H 3 (Fig. [4 
This is also the case for the coalescent without migration (see Fig. [2]). The likelihood 



function is the same as Equation 14 with m = 0. Let /o(ti, t 0 , H) denote the probability 



density of t\, t 0 and H in this special case when m = 0. Then by Equations 10- 13 



fo(ti, t 0 , Hia) — f 0 (ti,to,H u ) — 0, 



(16) 



^12 



(17) 



7 123 



/o(*i, t 0 , Hid) = fo(h, t 0 , H 2 ) = fo(ti,t 0 , H 3 ) — - x (1 — /o(ti, t 0 , H lc )). (It 
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The likelihood function therefore can be written as 



L(e 0 \D) = l[P(D t \Q 0 ), (19) 



i=l 



where 

F(A|e„) = If F(A|fl"fc,*o,ti)/o(to,ti,^fc)dtodti, (20) 

fce{lc,ld,2,3} 

and where the parameter vector 0 O = (#12, #123, Ti, to) because # does not affect the 
likelihood value, thus is unidentifiable. 

Ml. — The difference between Ml and MO is that Ml allows the species divergence time T\ 



of species 1 and species 2 to vary among loci at random due to possible gene flow. Yang 



(2010) chooses a beta distribution to model this. The density of t% is 



f{rx\r 0 ,p,q) = -^(^^(l - ^)^-, 0 < n < r Q , (21) 

where tq, p, and q are prameters of the distribution. He then changes variables by making 
x\ = ^. Then x\ ~ beta(p, q), 0 < x\ < 1. The mean X\ of x\ is and the variance is 
( p+g )2^p +g+1 ) ) so V — ~\Z^q- Treating x\ and q as the parameters of the distribution of x%, 
the density of X\ can be written as f{x\\xi, q). The likelihood function is 



L{Q\D) = \[P{D % \® 1 ) 1 (22) 

i=i 

where 



P(A|©i) = / ( 1 1 p { D i\ H kMM)U{toM,H k \e m 9i2ZiTa,Ti = T 0 x 1 )dt 0 dt 1 jf(x 1 \x 1 ,q)dx 1 . 

(23) 
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and where the parameter vector 0! = (0 12 , #123,^1, <?, t 0 ). The estimate of the speciation 
time T\ is T\ = XiTq. 

We denote our method STEST (SIM3s) if SIM3s is used to estimate speciation 
times. STEST (MO) and STEST (Ml) are defined similarly. Once a speciation time 
estimation method is picked, the distance matrix can be built easily. 

Distance Matrix Building 

Let f2o={Si}™ =3 {n > 3) be a set of species. Let So be the outgroup. Within each 
species Si (0 < i < n), multiple genes $ are sampled. For each pair of species (Si, Sj), 
0 < i 7^ j, (gi, 9j,9o) are used to estimate the speciation time between species Si and Sj 
using one of the methods MO, Ml or SIM3s. We define D = (tij) to be the distance matrix, 
which is a symmetric nxn matrix that contains the speciation times for all pairs of species. 

Species Tree Reconstruction 

Let T 0 = {tij}i<:j be the set of all of the entries of the lower triangular part of the 
distance matrix D, i.e., distance between every pair of species. The following algorithm is 
performed: 

1. Pick the smallest time ti 1 j 1 in T 0 , write T\ = ti l ,j 1 , add a new node at time 
7~i to connect S^ and S^. 

2. Suppose k nodes have been added and a set Q C f2o of species has been 
connected on the tree. Pick the smallest time U j b among the remaining times. 

Case 1. If {S ia , Sj b } Pi O = 0, add a new node at time r fe+1 = t ia j b 
connecting S ia and Sj b . 

Case 2. If {S ia , Sj b } D O = {Sj b }, then add a new node at time 
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r fc+i = ti a ,j b connecting S ia and the node at T mbk , where r mftfc is the 
largest time at which the node is connected to Sj b after k nodes have 
been added. Similarly for the case in which {S ia , Sj b } HQ = {Sj a }. 

Case 3. If {S ia , Sj b } C Q, then (i) if S ia and S ia share an ancestor, 
then discard the time U a j b , this step is finished; (ii) if St a and St a 
don't share an ancestor, add a new node at time Tk+i = t% a ,j b to 
connect the nodes at T ma k and r mb k . 

3. Continue until all species share a common ancestor, i.e. the root is reached. 

An Example 

To illustrate this method, we consider a set S—{Si, S2, S3, S4, S5} consisting of 5 
species. Let D=(tij) be the distance matrix, for example, 





Si 


So 


S3 


Si 


s. 


\ 


Si 


( 0 


0.5 


1 


2.1 


2.2 


S2 


0.5 


0 


1.2 


2.3 


2.2 




S3 


1 


1.2 


0 


2 


2.1 




S4 


2.1 


2.3 


2 


0 


0.75 




s 5 


v 2 ' 2 


2.2 


2.1 


0.75 


0 


/ 



Then T 0 = {tij}i<,-={0.5, 0.75, 1, 1.2, 2, 2.1, 2.2, 2.3}. We perform the clustering 
algorithm step by step (see Fig. [5]) . 

1. Pick the sandiest element in T 0 , t\ .2=0.5. Add a new node at time 
r i=^i,2=0.5 to connect Si and S 2 - After this step, f2={Si, S 2 }, T = T 0 — {0.5} . 

2. Pick the smallest element in T, t 45 =0.75. Since {S^, Sj 5 } fl Q = 0, add a 
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new node at time r 2 =t 4j5 to connect S 4 and S 5 . After this step, 
n={S!, S 2 , S 4 , S 5 }, T = T 0 - {0.5, 0.75}. 

3. Pick the smallest element in T, t 13 =l, then {Si, S 3 } PI O = {Si}, and 

r mi 2 = T\ . Add a new node at time T3 to connect S3 and the node at T\ . After 
this step, f2={Si, S 2 , S 3 , S 4 , S 5 }, T = T 0 — {0.5, 0.75, 1}. 

4. Pick the smallest element in T, £2,3=!, but {S 2 , S 3 } C Q and S 2 & S 3 share a 
common ancestor at r 2 so nothing is done. After this step, 

Q={S h S 2 , S 3 , S 4 , S 5 }, T = T 0 - {0.5, 0.75, 1, 1.2}. 

5. Pick the smallest element in T, £3,4=2. Since {S3, S 4 } C Q and S3 and S 4 do 
not share a common ancestor, we need to add a new node to connect the nodes 
at r ma3 and r m43 , i.e., nodes at r 3 and r 2 . After this step, 

Q={S h S 2 , S 3 , S 4 , S 5 }, T = T 0 - {0.5, 0.75, 1, 1.2, 2}. 

6. Root is reached! 

This algorithm can be easily implemented in R. We analyze both simulated data 
and empirical data to evaluate the performance of our methods. 

Simulation Study 

Simulation Study 1: Four-taxon Tree Under the n-island Model. — Figure [6] shows the 
model species tree and parameters for the first simulation study. 100 genes are sampled 
with one lineage sampled from each species under a four-taxon species tree. Since the 
methods M0, Ml and SIM3s all require an outgroup, we specify species 0 to be the 
outgroup. All of the population size parameters are assumed to be the same and equal to 
6 = 4. Three bifurcating speciation events happen at times t±, t 2 , and T3. Gene flow exists 
among all but population 0 before t% with being the migration rate from population % 
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to population j (1 < i,j < 3). All the migration rates are equal. Thus the migration 



pattern follows an n-island model (Wright 1943). 



To simulate the data, gene trees are first sampled from ms (Hudson 2002) under 18 



different settings (labeled by Al~A9 and B1~B9, see Table [T]). Seq-Gen (Rambaut and 



Grassly 1997) is then used to generate full sequence data from the simulated gene trees 



under the JC69 model (Jukes and Cantor 1969). The length of each gene is set to be 1, 000 



bp. STEST (MO), STEST (Ml), STEST (SIM3s) and *BEAST (for 8 settings as indicated 
m Table Q are used to get species tree estimates directly from sequence data. Gene trees 



are first estimated by PAUP* (Swofford 2002) using maximum likelihood (ML) and are 
then used as the input to STEM. For each setting, the same procedure is repeated 80 times. 



Simulation Study 2: Nine-taxon Tree Under the n-island Model. — Figure [7] shows the 
model species tree and parameters for the second simulation study. 100 genes are sampled 
with one lineage sampled from each species under a nine-taxon species tree. We specify 
species 0 to be the outgroup. All of the population size parameters are assumed to be the 
same and equal to 9 = 4. Eight bifurcating speciation events happen at times r 1; r 2 , r 3 , r 4 , 
T5, and r 6 . Gene flow exists among all but population 0 before T\ with the migration 
rate from population i to population j (1 < i, j < 3). All of the migration rates are 
assumed to be equal. Thus the migration pattern again follows an n-island model. 

Just as in the previous section, gene trees are sampled from ms and then Seq-Gen is 
used to generate sequence data from these simulated gene trees under the JC69 model. 
The length of the simulated sequences is 1,000 bp. STEST (M0), STEST (Ml) and 
STEST (SIM3s) are applied to the sequence data directly. PAUP* is used to estimate ML 
gene trees from sequence data before STEM is applied. The 18 parameter settings C1~C9 
and D1~D9 are listed in Table [2} We use 100 replicates for each setting. 
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Empirical Study 

We apply the methods STEST (SIM3s), STEST (MO) and STEST (Ml) to the 
HGCOR (Human, Chimpanzee, Gorilla, Orangutan and Rhesus) data set obtained by 



Ebersberger et al.| (2007), who have shown that the species tree topology is 
((((G,0),C),H),R). R (Rhesus) is the outgroup. The data set contains 28, 160 sequence 
alignments. 249 of sequence alignments are longer than 1, 000 bp and are used in this 
analysis. This data set is of interest for analysis with these methods, since Zhu and Yang 
(2012) recently reported gene flow following speciation among some of these taxa. 

Results 



Results for Simulation Study 1 

Results from the simulation study 1 are given in Tables [3] and [4] and are plotted in 
Figure [8j For each setting, the number of correct tree estimates (out of 80) is recorded and 
translated into a percentage. 

A1^A9. — In the short speciation interval scenarios, *BEAST and STEST (M0) have very 
similar performance (percentage of correct estimates > 85%), which is better than all of 
the other methods in all cases. When gene flow doesn't exists, STEM estimates 85% of the 
total trees correctly, which is close to STEST (Ml)'s 86% correct, and better than STEST 
(SIM3s)'s 71% correct. In the presence of gene flow, the STEST methods consistently yield 
better results than STEM. The performance of all of the methods decreases as the 
migration rate increases. Particularly, STEM's estimation accuracy drops more dramtically 
than all of the other methods (decreases in accuracy from 85% to 46% when the migration 
rate changes from 0 to 0.025). The accuracy curves of *BEAST and STEST (M0) are 
almost flat and remain in a high level (percentage of correct estimates > 97%) when the 
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migration rate is not larger than 0.10. Then they start to drop slowly as the migration rate 
increases but are still above 85% when the migration rate is increased to 0.20. The 
performance of STEST (Ml) and STEST (SIM3s) follows a similar trend. STEST (Ml) 
performs better than STEST (SIM3s) with a ~10% difference in accuracy and STEST 
(M0) is better than STEST (Ml) with a -15% difference. 

B1~B9.- 

In the long speciation interval scenarios, all the methods peform well when there is 
no gene flow (percentage of correct estimates > 94%), and start to perform worse as the 
migration rates increases. Again, STEM's performance decreases dramatically as the 
migration rate increases (decreases in accuracy from 100% to 46% when the migration rate 
changes from 0 to 0.025). Similar thing happens to *BEAST and STEST (M0) with the 
steepest drop in their performance occuring when the migration rate changes from 0.05 to 
0.10. STEST (M0) performs better than *BEAST when the migration rate is smaller than 
0.05 and the opposite happens when the migration rate is larger than 0.05. STEST (Ml) 
and STEST (SIM3s) outperform all the other methods under every setting. Their 
performance curves behave similarly to each other, which are almost flat and remian in a 
high level (percentage of correct estimates > 90%) until the migration rate is increased to 
0.075. Their accuracy is still above 50% even when the migration rate is increased to 0.20. 

In both scenarios, STEM always performs the worst in the presence of gene flow. In 
the short speciation interval senarios, *BEAST and STEST (M0) outperforms the other 
methods and their performance curves are very close to each other. In the long speciation 
interval scenarios, the same thing happens to STEST (Ml) and STEST (SIM3s). The 
performance curves in the long speciation interval scenarios are steeper than in the short 
speciation interval scenarios. 
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Results for Simulation Study 2 

Results from the simulation study 2 are given in Tables [5] and [6] and are plotted in 
Figures|9]- [38} 

C1~C9. — In the cases when T\ = 1, all methods perform well in the presence of gene flow 
in terms of the percentage of correct estimates (> 86%, see Fig. [9]). All methods' 
performance decreases as the migration rate increases. STEM's estimation accuracy 
quickly goes down to 0% correct when the migration rate is 0.025 while all the other 
methods still remain above 55% when the migration rate is 0.10. STEST (MO) outperforms 
all of the other methods when the migration rate is small (0 ~ 0.075). Its accuracy curve 
starts to drop below STEST (Ml)'s when the migration rate is larger than 0.1 and starts 
to drop below STEST (SIM3s)'s when the migration rate is larger than 0.125. STEST 
(Ml)'s accuracy is consistently better than STEST (SIM3s) with a ~15% difference. When 
the migration rate is increased to 0.20, STEST (Ml) performs the best with a 47% 
accuracy. STEST (SIM3s) is the second best with a 26% accuracy (Tableland Fig. |9^i). 

The average Robinson-Foulds distances (RF distances, see Robinson and Foulds, 
1981) between all the estimates and the correct tree (aRFdac) for different methods and 
different migration rates are plotted in Figure [9]d. Note that RF distances is designed to 
measure the distances between unrooted trees. It is possible that two different rooted trees 
have RF distance 0. Nontheless, RF distance is still the most popular metric for rooted 
trees. All methods' aRFdac increases as the migration rate increases. Even though 
STEM's estimation accuracy decreases to 0 when the migration rate is 0.025, its aRFdac is 
just 6.98. This distance continues to increase as the migration rate increases. It attains 
values above 10 when the migration rate is larger than 0.1. STEST (M0)'s aRFdac is the 
smallest when the migration rate is small (< 0.10). It becomes larger than STEST (Ml)'s 
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when the migration rate is larger than 0.125 (0.75 > 0.70 when m = 0.125), and becomes 
larger than STEST (SIM3s)'s when the migration rate is larger than 0.175 (2.32 > 1.95 
when m = 1.75). STEST (Ml)'s aRFdac is smaller than STEST (SIM3s)'s under all 
settings. It never exceeds 0.83 and STEST (SIM3s)'s aRFdac never exceeds 2.17 (Table [5 



The average RF distances between the incorrect estimates and the correct tree 
(aRFdic) for different migration rates are plotted in Figure [9^. All methods' aRFdic 
increases as the migration rate increases. STEM's aRFdic curve is very similar to its 
aRFdac curve except that the minimum possible value it attains is 2 instead of 0. STEST 
(Ml) and STEST (SIM3s)'s aRFdic increases very slowly as the migration rate increases. 
Their aRFdic curves are similar to each other. STEST (Ml)'s aRFdic never exceeds 2.59 
and STEST (SIM3s)'s aRFdic never exceeds 3.06 when migration rate falls in the interval 
(0,0.20). STEST (M0)'s aRFdics, ranging from 2.56 to 5.16, is always larger than STEST 
(Ml)'s and STEST (SIM3s)'s. 



Figures S1^S4 are the histograms showing the frequency of the RF distances for 
species tree estimates using different methods. When migration rate is 0, most of the 
estimates have zero RF distances to the correct tree. The histogram shows a unimodal 



distribution with the peak at the RF distance 0 (See Figs. SliS2iS3iS4i). As the 
migration rate increases, more and more estimates have large distances to the correct tree. 
The distribution first becomes multimodal with multiple short peaks or uniform (e.g., see 
Fig. |S1|j), and then becomes unimodal again with the peak at a high RF distance value 



e.g., see Fig. |Sl)). This process is observed in STEM shown in Figure SI All the methods 



seem to follow a similar trend. For STEST (M0) and STEST (SIM3s), their distributions 



are about to become multimodal when the migration rate is increased to 0.20 (see Fig. S2 



and Fig. S4 ). But STEST (Ml)'s distribution remains unimodal with the peak at RF 



distance 0 in all cases we investigate (Fig. S3 ) 
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Dlr^D9. — In the cases when T\ = 2, STEST (MO) and STEST (Ml) perform very well in 
the presence of gene flow in terms of the percentage of correct estimates (> 96%, see 
Fig. [9]). STEM and STEST (SIM3s) also performs well with 78% correct and 75% correct, 
respecitively, in estimation accuracy. All methods' performance decreases as the migration 
rate increases. Again, STEM's estimation accuracy quickly goes down to 0% correct when 
the migration rate is 0.025. STEST (Ml)'s estimation accuracy also decreases dramatically 
as the migration rate increases. It goes down to 1% at the migration rate 0.125. STEST 
(SIM3s)'s estimation accuracy stays 69 ~75% when the migration rate is smaller than 0.1. 
STEST (Ml)'s estimation accuracy decreases from 96% to 71% when the migration rate is 
increased from 0 to 0.075. However, STEST (SIM3s) and STEST (Ml)'s performance 
curves become similar to each other when the migration rate is larger than 0.1. They drop 
from 60% to below 20% as the migration rate increases from 0.10 to 0.20 (Tableland 
Fig.gi). 

The aRFdac for different methods under different migration rates are plotted in 
Figure [9^. All methods' aRFdac increases as the migration rate increases. The trend is 
more obvious than the previous cases. STEM's aRFdac attains values above 10 at 
migration rate as small as 0.025. When the migration rate is smaller than 0.05, STEST 
(M0), STEST (Ml) and STEST (SIM3s)'s aRFdacs are small (< 0.4) and do not increase a 
lot. STEST (M0)'s aRFdac curve starts to have a higher increasing rate when the 
migration rate increases from 0.05 to 0.20. Its aRFdac value is above 9 when the migration 
rate is increased to 0.175. STEST (Ml) and STEST (SIM3s)'s aRFdac curves are again 
very similar to each other and increase slowly as the migration rate increases. Their 
aRFdac values do not exceed 3.09 even when the migration rate is 0.20. 

The aRFdic for different methods under different migration rates are plotted in 
Figure |9]F. All methods' aRFdic increases as the migration rate increases. Again, STEM's 
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aRFdic curve is very similar to its aRFdac curve except that the minimum possible value it 
attains is 2 instead of 0. It attains values above 10 at the migration rate 0.025. STEST 
(MO)'s aRFdac also increases very fast. Its aRFdic value becomes larger than 8 at 
migration rate 0.15 and larger than 9 at migration rate 0.175. STEST (Ml) and STEST 
(SIM3s)'s aRFdic curves are again similar to each other. Their values increases slowly as 
the migration rate increases and stay smaller than 3 at migration rate smaller than 0.15 
and smaller than 4 at migration rate smaller than 0.20. 



Figures S5 - S8 are the histograms showing the frequency of the RF distances for 
species tree estimates using different methods. Similarly to the cases when t\ — 1, most of 
the estimates have zero RF distances to the correct tree at migration rate 0. The 
histograms show a unimodal distribution with the peak at the RF distance 0 (See 



Figs. S5i, S6i .S7i. S8i). As the migration rate increases, more and more estimates have 



large distances to the correct tree. The distribution first becomes multimodal with multiple 



short peaks or even uniform (e.g., see Fig. S6:), and then becomes unimodal again with 



the peak at a high RF distance value (e.g., see Fig. [S6l). This whole process is observed in 



STEST (M0) shown in Figure S6 All methods seem to follow this trend. STEM skips the 
multimodal or uniform stage. STEST (SIM3s)'s distribution is about to become 
multimodal when the migration rate is increased to 0.20. The distribution for STEST (Ml) 
remains unimodal with the peak at RF distance 0 under all settings we investigate 



(Fig. S3) 



Empirical Study Results 

The species tree estimates obtained using STEST (M0) and STEST (Ml) both agree with 
the species tree topology obtained by Rannala and Yang| ( |2003 ) , which is (((H,C),G),0). 



The speciation time estimates f HC , thcg an d t hcgo are listed in Table [7] in units of 
expected number of mutations per site, as in Rannala and Yang (2003). The running times 
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are 14 seconds and 95 seconds for STEST (MO) and STEST (Ml), respectively, on a Linux 
machine with two eight core Xeon E5-2680 (2.8 GHz) CPUs and 384 GB ram. 

When attempting to use STEST (SIM3s) to estimate the species tree, we found that 
the SIM3s method was not able to estimate the speciation time when both taxa R (the 
outgroup) and O were included. Thus Table [7] gives the speciation time estimates for the 
other divergences, and only a lower bound on the speciation time for the split between taxa 
H, C, G and O. The total time to carry out this analysis was 130 seconds on the same 
machine. 

Discussion and Conclusion 



Simulation Study 1. — When gene flow does not exist, STEST (Ml) and STEST (SIM3s) 
perform worse than *BEAST and STEST (MO) in the short speciation interval scenario 
(Fig. [8^), which can be explained by the fact that short n causes many deep coalescent 
events. The methods Ml and SIM3s may mistakenly attribute the species tree-gene tree 
conflicts partially to gene flow. This is possible because Ml and SIM3s only model the 
in-population processes within two focal populations (e.g., when n is estimated, population 
1 and population 2 are the two focal populations) and ignore the migration of any alleles 
between the two focal populations and any other populations (e.g., one allele is moved to 
population 123 at the time interval r 2 , see Fig. [6]). In the long speciation interval scenario, 
7~i = 2 is long enough, which implies that there are not so many deep coalescent events. All 
the methods have similar good performance (estimation accuracy all above 94%, Fig. [8Jd) . 
This also demonstrates the influence of deep coalescence in phylogenetic inference problems. 

When gene flow does exist, *BEAST and STEST (MO) perform excellently in the 
short speciation interval scenarios (estimation accuracy above 86%, see Fig. [8^). This is 
because they assume the gene tree species tree conflicts are exclusively due to deep 
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coalescence and short speciation interval causes many deep coalescent events, which exert 
much more influence on these conflicts than gene flow. Their performance decreases as the 
migration rate increases because the incongruence between gene trees and the species tree 
is influenced more and more by gene flow. The reason why STEST (Ml) and STEST 
(SIM3s) do not perform as well as *BEAST and STEST (MO) might be the same as in the 
cases when there is no gene flow, i.e., gene flow occurs between two focal populations and 
the other populations. There are two possible cases in the presence of gene flow: the first is 
that before time Ti, gene flow occurs between the two focal populations and other 
populations in both directions, the second is that at time T2, one linage is moved from an 
out population into the focal populations. The difference in STEST (Ml) and STEST 
(SIM3s)'s performance may be due to either Ml and SIM3s' different ability to estimate 
speciation times, or their different tolerance to the violation of their assumptions in our 
approach. Further study can be designed to find out which reason is more plausible. For 
now, we only want to evaluate the performance of the idea to estimate species trees. 
STEM's performance curve is different. It decreases dramatically as migration rate 
increases. The reason is that as gene flow increases, the minimal coalescent time tends to 
zero. Therefore, STEM produces a lot of unresolved species trees, which implies that the 
data doesn't have enough information for species tree estimation through STEM's 
approach. 

In the long speciation interval scenario, deep coalescence is no longer a problem. 
The incongruence between gene trees and species trees is mostly due to gene flow. 
Therefore, STEST (Ml) and STEST (SIM3s) outperform *BEAST, STEST (MO) and 
STEM almost everywhere. Different from the short speciation interval scenarios, STEST 
(Ml) and STEST (SIM3s)'s performance curves are very similar to each other, which 
indicates that their difference in performance is related to the many deep coalescent events 
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in the previous scenarios. In most of the cases, *BEAST performs better than STEST 
(MO), which performs better than STEM. The performance curves are also decreasing as 
the migration rate increases. However, the slope of the performance curve is steeper than 
that in the short speciation interval scenario. The possible reason is that longer speciation 
interval allows more migration events when the migration rates are the same. Therefore, in 
the long speciation interval scenario, the same amount of increment in migration rate 
produces a larger increase in the number of migration events, which makes the performance 
of these methods decrease more. 

There are multiple possible reasons why the performance of STEST (Ml) and 
STEST (SIM3s) decrease when the migration rate increases. The first one is that the 
assumption that no other populations are exchanging genes with the focal populations in 
the Ml model and SIM3s model is violated. When there are not so many migration events, 
such violation does not matter a lot. However, when the speciation interval is long and 
migration rate between the focal populations and the unfocal populations is large, these 
methods are no longer applicable to estimate speciation times between two species. The 
second possible reason is that when the migration rate is large, the likelihood surface 
becomes bizarre. Therefore it is more difficult to locate the global maximum of the 
likelihood function. To improve this, Ml and SIM3s could be replaced by better methods 
(if any were developed) to estimate speciation times with the presence of gene flow. 

Simulation Study 2. — When gene flow does not exist, all methods perform well in the cases 
when n = 1 is moderate. When n = 2 is long, STEST (MO) and STEST (Ml) still 
perform very well in terms of estimation accuracy (> 95%). However, STEST (SIM3s) and 
STEM's estimation accuracy fall slightly below 80%, which is different from the first 
simulation study, in which case all methods have good performance. One possible reason is 
that r 2 — 7~i = 1 in this case and r 2 — t\ = 2 in the first simulation study. Thus, long 
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ancestral speciation intervals might be helpful in species tree estimation in the presence of 
gene flow. Another possible reason is that larger trees are more difficult to estimate. 

When gene flow does exist and T\ — 1 is moderate, STEST (Ml) performs the best 
when the migration rate is small (< 0.75), which implies deep coalescence is the main 
reason for the gene tree-species tree conflict. As expected, STEST (Ml)'s performance 
starts to fall behind STEST (Ml) and STEST (SIM3s) when the migration rate is large 
enough, which means that gene flow becomes the overwhelming factor for the conflict. 
STEST (Ml) again performs consistently better than STEST (SIM3s). This could be the 
same reason as in simulation study 1. STEM again performs the best. This could also be 
explained by the same reason as in simulation study 1. 

When T\ = 2 is large, STEST (Ml) outperforms all the other methods since more 
migration events are allowed even when migration rate is small. STEST (SIM3s) performs 
worse than STEST (Ml) when the migration rate is smaller than 0.10. This might have the 
same reason as in the T\ — 1 cases. When the migration rate is larger than 0.10, STEST 
(SIM3s) and STEST (Ml) have similar performance. Their accuracy curves decrease more 
dramtically than the T\ — \ cases, because larger T\ allows more migration events for the 
same increase in migration rate. It also decreases more dramtically than the long speciation 
interval scenarios in simulation study 1. This is because in simulation study 1, only one 
population exchanges genes with the two focal populations, while in this case, there are 5 
more populations exchanging genes with the two focal populations before t\. STEST 
(MO)'s performance decreases much more dramatically than in the previous cases, which 
shows it cannot deal with data subject to large migration rates. When migration rates are 
very large, speciation boundaries are not clear, and thus this behavior is not unexpected. 



Empirical Study. — In the empirical study, all of the STEST-based methods STEST (MO), 
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STEST (Ml) and STEST (SIM3s) yield the correct tree topology within three minutes. 
The rapid and accurate performance of these methods for these data demonstrates the 
potential for the application of these methods to large-scale empirical data. 

Conclusion. — To summarize, STEST (MO) provides an alternative approach to *BEAST 
for estimation of species trees in the presence of deep coalescence. It is much faster and has 
a comparable estimation accuracy. When the data follow the n-island migration model, 
STEST (MO) is appropriate for species tree estimation when the speciation interval for 
migration is short. When the speciation interval for migration is moderate, STEST (MO) is 
recommended for data subject to small migration rates and STEST (Ml) is recommended 
for data subject to large migration rates. When the speciation interval for migration is 
long, STEST (Ml) is the better choice. 

There are multiple ways to improve the performance of our methods. One way, for 
example, is to develop better speciation time estimation methods. Our idea is to use the 
speciation time estimates as distances to estimate species trees. The better quality the 
speciation time estimates are, the better accuracy our method will have. Another way is to 
find the best strategy to accommodate different and more informative data types. For 
example, extension of the SIM3s, MO and Ml methods to handle multiple sampled lineages 
per species would allow our method to be applied in this setting. 
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Table 1: Settings for simulation study l a . 





Migration Rate 
0 fe 0.025 0.05 fe 0.075 0.1 6 0.125 0.15 0.175 0.2 b 


Speciation Interval ^ 


Al A2 A3 A4 A5 A6 A7 A8 A9 
Bl B2 B3 B4 B5 B6 B7 B8 B9 



a Each entry in this table provides a label for a set of model parameters, e.g., Al 
corresponds to the setting in which the difference r 3 — r 2 and r 2 — T\ are both 0.5 and 
all migration rates are 0. 

b indicates the parameter settings for *BEAST. 
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Table 3: Simulation 1 results (short speciation interval scenario). 



% a 


Al 


A2 


A3 


A4 


A5 


A6 


A7 


A8 


A9 


*BEAST 


100 




100 




100 








89 


STEM 


85 


46 


56 


35 


49 


19 


18 


14 


28 


STEST (MO) 


100 


99 


99 


99 


97 


91 


90 


92 


86 


STEST (Ml) 


86 


83 


87 


89 


82 


75 


73 


77 


64 


STEST (SIM3s) 


71 


73 


78 


79 


69 


63 


66 


71 


65 



a Each entry is the percentage of the correct estimates. 
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Table 4: Simulation 1 results (long speciation interval scenario). 



% a 


Bl 


B2 


B3 


B4 


B5 


B6 


B7 


B8 


B9 


*BEAST 


94 




91 




54 








28 


STEM 


100 


46 


44 


35 


45 


19 


18 


14 


25 


STEST (MO) 


100 


99 


80 


55 


39 


34 


32 


32 


24 


STEST (Ml) 


98 


99 


99 


95 


85 


76 


76 


72 


55 


STEST (SIM3s) 


94 


99 


100 


91 


83 


78 


71 


61 


68 



Each entry is the percentage of the correct estimates. 
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Table 7: Speciation time estimates for n, T2 and T3 





thc 


T~HCG 


T~HCGO 


STEST (MO) 


0.00384 


0.00571 


0.01279 


STEST (Ml) 


0.00396 


0.00572 


0.01331 


STEST (SIM3s) 


0.00384 


0.00569 


> 0.0223 
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Figure Captions 

Figure 1. Factors responsible for the incongruence of gene trees and the species tree, a) 
Deep coalescence, b) Gene flow, c) A-gene duplication, B-gene loss. 

Figure 2. Different gene histories given a three-taxon species tree, a) G a : linages sampled 
from population 1 and 2 coalesce in the ancestral population 12 during the time interval t. 
b) Gb. both coalescent events happen in the ancestral population 123 with gene tree 
((1,2), 3). c) G c : both coalescent events happen in the ancestral population 123 with gene 
tree ((2,3),1). d) Gd- both coalescent events happen in the ancestral population 123 with 
gene tree ((1,3), 2). 

Figure 3. A two population IM model, a) Model species tree and parameters, b) 
Illustration of state change: A. S(l,l) to S(2,0) by migration, B. S(2,0) to S(1,0) by 
coalescence, C. S(1,0) to S(0,1) by migration. 

Figure 4. Zhu and Yang's SIM3s model, a) Model species tree and parameters, b) H± a : 
Coalescence of 1,2 happens first in population 1, t\ < t\. c) FL\b- Coalescence of 1,2 
happens first in population 2, t\ < T\. d) H\ c : Coalescence of 1,2 happens first in 
population 12, T\ < t\ < r 0 . e) B.\d- Coalescence of 1,2 happens first in population 123, 
To < t\ < to- f) H 2 : Coalescence of 2,3 happens first in population 123, r 0 < ti < t 0 . g) H 3 : 
Coalescence of 1,3 happens first in population 123, r 0 < t x < t 0 . 

Figure 5. Illustration of tree reconstruction algorithm. 

Figure 6. Model species tree and parameters for simulation study 1. 

Figure 7. Model species tree and parameters for simulation study 2. 

Figure 8. Results plot for simulation study 1. Black: Results from *BEAST; Red: Results 
from STEST (MO); Purple: Results from STEST (Ml); Pink: Results from STEST 
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(SIM3s); Green: Results from STEM, a) is the percentage of the correct estimates vs. the 
magnitude of the gene flow used to generate data in the short speciation interval case 
(Al~A9). b) is the percentage of the correct estimates vs. the magnitude of the gene flow 
used to generate data in the long speciation interval case (B1~B9). 

Figure 9. Results plot for simulation study 2. Green: Results from STEM; Red: Results 
from STEST (MO); Black: Results from STEST (Ml); Purple: Results from STEST 
(SIM3s). a) is the percentage of the correct estimates vs. the magnitude of the gene flow 
used to generate data for n = 1 (C1~C9). b) is the average Robinson-Foulds distances 
between all estimates and the correct tree vs. the magnitude of the gene flow used to 
generate data for n = 1 (C1~C9). c) is the average Robinson-Foulds distances between 
incorrect estimates and the correct tree vs. the magnitude of the gene flow used to 
generate data for n = 1 (C1~C9). d) is the percentage of the correct estimates vs. the 
magnitude of the gene flow used to generate data for T\ — 2 (D1~D9). e) is the average 
Robinson-Foulds distances between all estimates and the correct tree vs. the magnitude of 
the gene flow used to generate data for T\ = 2 (D1~D9). f) is the average Robinson-Foulds 
distances between incorrect estimates and the correct tree vs. the magnitude of the gene 
flow used to generate data for T\ — 2 (D1~D9). 
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Supplementary Figure Captions 

Figure SI. Frequency histogram showing the distribution of Robinson- Foulds distances 
between estimates using STEM and the correct tree under settings: a) CI, b) C2, c) C3, d) 
C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 

Figure S2. Frequency histogram showing the distribution of Robinson- Foulds distances 
between estimates using STEST (MO) and the correct tree under settings: a) CI, b) C2, c) 
C3, d) C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 

Figure S3. Frequency histogram showing the distribution of Robinson- Foulds distances 
between estimates using STEST (Ml) and the correct tree under settings: a) CI, b) C2, c) 
C3, d) C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 

Figure S4. Frequency histogram showing the distribution of Robinson-Foulds distances 
between estimates using STEST (SIM3s) and the correct tree under settings: a) CI, b) C2, 
c) C3, d) C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 

Figure S5. Frequency histogram showing the distribution of Robinson-Foulds distances 
between estimates using STEM and the correct tree under settings: a) Dl, b) D2, c) D3, d) 
D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 

Figure S6. Frequency histogram showing the distribution of Robinson-Foulds distances 
between estimates using STEST (MO) and the correct tree under settings: a) Dl, b) D2, c) 
D3, d) D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 

Figure S7. Frequency histogram showing the distribution of Robinson-Foulds distances 
between estimates using STEST (Ml) and the correct tree under settings: a) Dl, b) D2, c) 
D3, d) D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 

Figure S8. Frequency histogram showing the distribution of Robinson-Foulds distances 
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between estimates using STEST (SIM3s) and the correct tree under settings: a) Dl, b) D2, 
c) D3, d) D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 
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Figure 1: Factors responsible for the incongruence of gene trees and the species tree, a) 
Deep coalescence, b) Gene flow, c) A-gene duplication, B-gene loss. 
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Figure 2: Different gene histories given a three-taxon species tree, a) G a : linages sampled 
from population 1 and 2 coalesce in the ancestral population 12 during the time interval 
t. b) Gb- both coalescent events happen in the ancestral population 123 with gene tree 
((1,2), 3). c) G c : both coalescent events happen in the ancestral population 123 with gene 
tree ((2,3),1). d) Gd'- both coalescent events happen in the ancestral population 123 with 
gene tree ((1,3), 2). 



Downloaded from http://biorxiv.org/on September 18, 2014 




Figure 3: A two population IM model, a) Model species tree and parameters, b) Illustration 
of state change: A. S(l,l) to S(2,0) by migration, B. S(2,0) to S(1,0) by coalescence, C. 
S(1,0) to S(0,1) by migration. 
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Figure 6: Model species tree and parameters for simulation study 1. 
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ure 7: Model species tree and parameters for simulation study 2. 
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Figure 8: Results plot for simulation study 1. Black: Results from *BEAST; Red: Results 
from STEST (MO); Purple: Results from STEST (Ml); Pink: Results from STEST (SIM3s); 
Green: Results from STEM, a) is the percentage of the correct estimates vs. the magnitude 
of the gene flow used to generate data in the short speciation interval case (Al~A9). b) is 
the percentage of the correct estimates vs. the magnitude of the gene flow used to generate 
data in the long speciation interval case (B1~B9). 
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Figure SI: Frequency histogram showing the distribution of Robinson- Foulds distances be- 
tween estimates using STEM and the correct tree under settings: a) CI, b) C2, c) C3, d) 
C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 
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Figure S2: Frequency histogram showing the distribution of Robinson- Foulds distances be- 
tween estimates using STEST (MO) and the correct tree under settings: a) CI, b) C2, c) C3, 
d) C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 
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Figure S3: Frequency histogram showing the distribution of Robinson- Foulds distances be- 
tween estimates using STEST (Ml) and the correct tree under settings: a) CI, b) C2, c) C3, 
d) C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 
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Figure S4: Frequency histogram showing the distribution of Robinson-Foulds distances be- 
tween estimates using STEST (SIM3s) and the correct tree under settings: a) CI, b) C2, c) 
C3, d) C4, e) C5, f) C6, g) C7, h) C8, and i) C9. 
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Figure S5: Frequency histogram showing the distribution of Robinson- Foulds distances be- 
tween estimates using STEM and the correct tree under settings: a) Dl, b) D2, c) D3, d) 
D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 
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Figure S6: Frequency histogram showing the distribution of Robinson- Foulds distances be- 
tween estimates using STEST (MO) and the correct tree under settings: a) Dl, b) D2, c) 
D3, d) D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 
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Figure S7: Frequency histogram showing the distribution of Robinson-Foulds distances be- 
tween estimates using STEST (Ml) and the correct tree under settings: a) Dl, b) D2, c) 
D3, d) D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 
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Figure S8: Frequency histogram showing the distribution of Robinson-Foulds distances be- 
tween estimates using STEST (SIM3s) and the correct tree under settings: a) Dl, b) D2, c) 
D3, d) D4, e) D5, f) D6, g) D7, h) D8, and i) D9. 



